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Abstract 

Mattices have become essential data representations for many large-scale problems in data analytics, 
and hence matrix sketching is a critical task. Although much research has focused on improving the 
error/size tradeoff under various sketching paradigms, the many forms of error bounds make these ap¬ 
proaches hard to compare in theory and in practice. This paper attempts to categorize and compare most 
known methods under row-wise streaming updates with provable guarantees, and then to tweak some of 
these methods to gain practical improvements while retaining guarantees. 

For instance, we observe that a simple heuristic iSVD, with no guarantees, tends to outperform 
all known approaches in terms of size/error trade-off. We modify the best performing method with 
guarantees FrequentDirections under the size/error trade-off to match the performance of iSVD and 
retain its guarantees. We also demonstrate some adversarial datasets where iSVD performs quite poorly. 
In comparing techniques in the time/error trade-off, techniques based on hashing or sampling tend to 
perform better. In this setting we modify the most studied sampling regime to retain error guarantee but 
obtain dramatic improvements in the time/error trade-off. 

Finally, we provide easy replication of our studies on APT, a new testbed which makes available not 
only code and datasets, but also a computing platform with fixed environmental settings. 


1 Introduction 


Matrix sketching has become a central challenge l[7j 30 38 50 ^ in large-scale data analysis as many large 
data sets including customer recommendations, image databases, social graphs, document feature vectors 
can be modeled as a matrix, and sketching is either a necessary first step in data reduction or has direct 
relationships to core techniques including PCA, LDA, and clustering. 

There are several variants of this problem, but in general the goal is to process an n x d matrix A to 
somehow represent a matrix so || A — or (examining the covariance) || A^A — B \\2 is small. 

In both cases, the best rank-A; approximation A^ can be computed using the singular value decomposition 
(svd); however this takes 0(ndmin(n, d)) time and 0{nd) memory. This is prohibitive for modem appli¬ 
cations which usually desire a small space streaming approach, or even an approach that works in parallel. 
For instance diverse applications receive data in a potentially unbounded and time-varying stream and want 


to maintain some sketch B. Examples of these applications include data feeds from sensor networks 1131, 
financial tickers |18||66| |, on-line auctions fill , network traffic p2l[60l , and telecom call records [ [^ . 


In recent years, extensive work has taken place to improve theoretical bounds in the size of B. Random 
projections (^591 and hashing 1^64| approximate A in iJ as a random linear combination of rows and/or 
columns of A. Column sampling methods pA][29l 30| 32} 35 51 ^ choose a set of columns (and/or rows) 


from A to represent B; the best bounds require multiple passes over the data. We refer readers to recent 
work | |T9l[4T||^ for extensive discussion of various models and error bounds. 


* Thanks to support by NSF CCF-1115677, CCF-1350888, IIS-1251019, and ACI-1443046. 
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Recently Liberty |50| introduced a new technique FrequentDirections (abbreviated FD) which is 
deterministic, achieves the best bounds on the covariance — B'^B \\2 error, the direct error ||j4 — 

^IIf (using B as a projection), and moreover, it seems to greatly outperform the projection, hashing, 
and column sampling techniques in practice. Hence, this problem has seen immense progress in the last 
decade with a wide variety of algorithmic improvements in a variety of models |[7]T^^[^^32 32 35 


[4T][50| |56[|59|; which we review thoroughly in Section]^ and assess comprehensively empirically in Section 

m 

In addition, there is a family of heuristic techniques | [T^ 4^ 44j 48j 57| (which we refer to as iSVD, 
described relative to FD in Section]^, which are used in many practical settings, but are not known to have 
any error guarantees. In fact, we observe (see Section on many real and synthetic data sets that iSVD 
noticeably outperforms FD, yet there are adversarial examples where it fails dramatically. Thus we ask 
(and answer in the affirmative): can one achieve a matrix sketching algorithm that matches the usual-case 
performance of iSVD, and the adversarial-case performance of FD, and error guarantees of FD? 


1.1 Notation and Problem Formalization 

We denote an n x d matrix ^ as a set of n rows as [oi; 02 ;..., an] where each a* is a row of length d. 
Alternatively a matrix V can be written as a set of columns [vi,V 2 -, ■ ■ ■ ,Vd]- We assume d n. We will 
consider streaming algorithms where each element of the stream is a row a* of A. Some of the algorithms 
also work in a more general setting (e.g., allowing deletions or distributed streams). 

The squared Frobenius norm of a matrix A is defined || A|||, = ll®i IP where ||aj || is Euclidean norm 

of row ai, and if infuifively represenfs fhe fofal size of A. The specfral norm ||A ||2 = max 2 ;.|| 3 .||=i ||^2;||, 
and represenfs fhe maximum influence along any unif direction x. If follows fhaf — B'^B \\2 = 

max2,.||j.||=i |||Ax|p — ||i?x|p|. 

Given a mafrix A and a low-rank mafrix X lef 7rx(^) = AX^X be a projection operation of A onfo 
the rowspace spanned by X; that is if X is rank r, then it projects to the r-dimensional subspace of points 
(e.g. rows) in X. Here X^ indicates taking the Moore-Penrose pseudoinverse of X. The singular value 
decomposition of A, written svd(A), produces three matrices [U, S, V] so that A = U. Matrix U is 
n X n and orthogonal. Matrix V is d x d and orthogonal; its columns [vi,V 2 , ■ ■ ■, Vd\ are the right singular 
vectors, describing directions of most covariance in A^A. S is n x d and is all Os except for the diagonal 
entries {ai,a 2 , ■ ■ ■ ,(7r}, the singular values, where r < d is the rank. Note that aj > (yj+i, plb = cri, 
and aj = describes the norm along direction Vj. 

Error measures. We consider two classes of error measures between input matrix A and its f x d sketch 
B. The covariance error ensures for any unit vector x G such that ||x|| = 1, that ||Ax|p — ||Hx|p is 
small. This can be mapped to the covariance of A^A since max|| 2 .||=i — B'^B\\ 2 . 

To normalize covariance error we set 

cov-err(A,H) = \\A'^A - B'^B\\2/\\A\\p 

which is always greater than 0 and will typically be less than 0. 150 

The projection error describes the error in the subspace captured by B without focusing on its scale. 
Here we compare the best rank k subspace described by B (as B^) compared against the same for A (as 
A/,). We measure the error by comparing the tail (what remains after projecting A onto this subspace) as 
IIA — 7rsj.(A)|||,. Specifically we normalize projection error so fhaf if is comparable across dafasefs as 

proj-err(A, H) = \\A - ttbAAWf/W^ - M\f- 

*The normalization term |j A|||. is invariant to the desired rank k and the unit vector x. Some methods have bounds on || — 

||Bx||^ that are relative to || A — A*; |p or || Aa;|p; but as these introduce an extra parameter they are harder to measure empirically. 
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Note the denominator is equivalent to ||A — 7r^j,(A)|p so this ensures the projection error is at least 1; 
typically it will be less than 1.5. We set A: = 10 as a default in all experiments 


1.2 Frequency Approximation, Intuition, and Resuits 

There is a strong link between matrix sketching and the frequent items (similar to heavy-hitters) problem 
commonly studied in the streaming literature. That is, given a stream S = {si,S 2 , ■ ■ ■, Sn) of items Sj G 
[rt] = {1, 2,..., u}, represent fj = |{sj G S' | Sj = j}\; the frequency of each item j G [rt]. The goal is to 
construct an estimate fj (for all j G [n]) so that \ fj — fj\ < £n. There are many distinct solutions to this 
problem, and many map to families of matrix sketching paradigms. 

The Misra-Gries |53| (MG) frequent items sketch inspired the FD matrix sketching approach, and the 
extensions discussed herein. The MG sketch uses 0{l/e) space to keep £ — 1 = 1/e counters, each labeled 
by some j G [u]: it increments a counter if the new item matches the associated label or for an empty 
counter, and it decrements all counters if there is no empty counter and none match the stream element, fj 
is the associated counter value for y, or 0 if there is no associated counter. There are other variants of this 
algorithm |26 46 ^ that have slightly different properties |[8||^ that we will describe in Section 3.1 to 
inspire variants of FD. 

A folklore approach towards the frequent items problem is to sample i = 0((l/e^) log 1/5) items from 
the stream. Then fj is the count of the sample scaled hy n/I. This achieves our guarantees with probability 
at least 1 — 5 611, and will correspond with sampling algorithms for sketching which sample rows of 

the matrix. 

Another frequent items sketch is called the count-sketch p7| . It maintains and averages the results of 
0(log 1/5) of the following sketch. Each j G [tt] is randomly hashed /i : [rt] —>• [£] to a cell of a table of size 
i = 0(l/e^); it is added or subtracted from the row based on another random hash s : [u] —)• {—1,+1}. 
Estimating fj is the count at cell h{j) times s{j). This again achieves the desired bounds with probability at 
least 1 — 5. This work inspires the hashing approaches to matrix sketching. If each element j had its effect 
spread out over all 0(log 1/5) instances of the base sketch, it would then map to the random projection 
approaches to sketching, although this comparison is a bit more tenuous. 

Einally, there is another popular frequent items sketch called the count-min sketch | |2^ . It is similar to 
the count-sketch but without the sign hash, and as far as we know has not been directly adapted to a matrix 
sketching algorithm. 


1.3 Contributions 

We survey and categorize the main approaches to matrix sketching in a row-wise update stream. We consider 
three categories: sampling, projection/hashing, and iterative and show how all approaches fit simply into one 
of these three categories. We also provide an extensive set of experiments to compare these algorithms along 
size, error (projection and covariance), and runtime on real and synthetic data. 

To make this study easily and readily reproducible, v/e implement all experiments on a new extension of 
Emulab Q called Adaptable Profile-Driven Testbed or APT Q. It allows one to check out a virtual machine 
with the same specs as we run our experiments, load our precise environments and code and data sets, and 
directly reproduce all experiments. 

We also consider new variants of these approaches which maintain error guarantees but significantly 
improving performance. We introduce several new variants of FD, one of which a-FD matches or ex- 

^There are variations in bounds on this sort of error. Some measure spectral (e.g. || • II 2 ) norm. Others provide a weaker error 
bound on \\A — [rrsfA)]*;|||-, where the “best rank k approximation,” denoted by [jfc, is taken after the projection. This is less 
useful since then for a very large rank B (might be rank 500) it is not clear which subspace best approximates A until this projection 
is performed. Additionally, some approaches create a set of (usually three) matrices (e.g. CUR) with product equal to ttb), [A), 
instead of just B. This is a stronger result, but it does not hold for many approaches, so we omit consideration. 
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ceeds the performance of a popular heuristic iSVD. Before this new variant, iSVD is a top performer in 
space/error trade-off, but has no guarantees, and as we demonstrate on some adversarial data sets, can fail 
spectacularly. We also show how to efficiently implement and analyze without-replacement row sampling 
for matrix sketching, and how this can empirically improve upon more traditional (and easier to analyze) 
with-replacement row sampling. 


2 Matrix Sketching Algorithms 

In this section we review the main algorithms for sketching matrices. We divide them into 3 main categories: 
(1) sampling algorithms, these select a subset of rows from A to use as the sketch (2) projection algo¬ 
rithms, these project the n rows of A onto £ rows of B, sometimes using hashing; (3) incremental algorithms, 
these maintain B as a low-rank version of A updated as more rows are added. 

There exist other forms of matrix approximation, for instance, using sparsification techniques l[7 12 361, 
or allowing more general element-wise updates at the expense of larger sketch sizes | fT^|20l[5^ . We are 
interested in preserving the right singular vectors and other statistical properties on the rows of A. 


2.1 Sampling Matrix Sketching 



£ 

cov-err 

£ 

proj-err 

runtime 

Norm Sampling 

Leverage Sampling 
Deterministic Leverage 

dje^ 

djs^ 

£ 

f(t) 

£(t) 


kje^ 

{k log k)/£^ 
(k/r/e) !/''(*) 


132 


nnz(A) • £ 

svd(A) -f nnz(A) • £ 
svd(A) + nnz(A) ■ £\og£ 

- ' -p-A 

1 “h £ 1 
1 “h £ 1 

A 

51; 

56; 


Table 1: Theoretical Bounds for Sampling Algorithms. The proj-err bounds are based on a slightly weaker 
11^ — III’ numerator instead of || A — vr^j, (A)||| one where we first enforce B^ is rank k. (f) See Ap¬ 

pendix]^ the Leverage Sampling bound assumes a constant lower bound on leverage scores. (*) Maximum 
of this and {k, where leverage scores follow power-law with decay exponent 1 + rj. 


Sampling algorithms assign a probability pi for each row Oj and then selecting i rows from A into B using 
this probability. In B, each row has its squared norm rescaled to Wi as a function of pi and ||ai||. One can 
achieve additive error bound using importance sampling with pj = ||aj|p/||A||| and Wi = ||oi|p/(^pi) = 
||A|||,/4 as analyzed by Drineas et al. |31| and |38|. These algorithms typically advocate sampling I 
items independently (with replacement) using £ distinct reservoir samplers, taking 0{£) time per element. 
Another version | |^ samples each row independently, and only retains £ rows in expectation. We discuss 
two improvements to this process in Section [3^ 

Much of the related literature describes selecting columns instead of rows (called the column subset 
selection problem) 1151. This is just a transpose of the data and has no real difference from what is described 
here. There are also techniques | [35| that select both columns and rows, but are orthogonal to our goals. 

This family of techniques has the advantage that the resulting sketch is interpretable in that each row of 
B corresponds to data point in A, not just a linear combination of them. 


Leverage Sampling. An insightful adaptation changes the probability pi using leverage scores p4| or 
simplex volume p7p8| . These techniques take into account more of the structure of the problem than simply 
the rows norm, and can achieve stronger relative error bounds. But they also require an extra parameter k as 
part of the algorithm, and for the most part require much more work to generate these modified pi scores. 
We use Leverage Sampling |351 as a representative; it samples rows according to leverage scores (described 
below). Simplex volume calculations |27 281 were too involved to be practical. There are also recent 
techniques to improve on the theoretical runtime for leverage sampling ||3^ by approximating the desired 
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values Pi, but as the exact approaches do not demonstrate consistent tangible error improvements, we do not 
pursue this complicated theoretical runtime improvement. 

To calculate leverage scores, we first calculate the svd of A (the task we hoped to avoid). Let Uk be 
the matrix of the top k left singular vectors, and let Uk{i) represent the fth row of that matrix. Then the 
leverage score for row f is Sj = ||[/A:(*)|p! the fraction of squared norm of a* along subspace Uk- Then set 
Pi proportional to Si (e.g. pi = Si/k. Note that Si = k). 


Deterministic Leverage Scores. Another option is to deterministically select rows with the highest Si 
values instead of at random. It can be implemented with a simple priority queue of size £. This has been 
applied to using the leverage scores by Papailiopoulos et al. |561, which again first requires calculating the 
svd of A. We refer to this algorithm as Deterministic Leverage Sampling. 


2.2 Projection Matrix Sketching 



£ 

cov-err 

£ 

proj-err 

runtime 

Random Projection 

dje^ 

59 ] 


elp{A) 

d/e^ 

59] 


1 + £ 

nnz(A )•£ 

Fast JLT 

dje^ 



elp{A) 

dje^ 



1 + £ 

ndlogd + {d/e^) logn ||^ 

Flashing 

d^je^ 1; 


4 ; 


e/p{A) 

(P L 


4 ; 

1 + £ 

nnz(A) -F npoly(d/£) 

OSNAP 

^l+o(s/eJ- 


54 


e/p{A) 


54; 

1 + £ 

nnz(A) • s -F npolyjd/s) 


Table 2: Theoretical Bounds for Projection Algorithms (via an £2 subspace embedding; see Appendix |C.21). 

11^112 -' 

Where £ is the number of rows maintained, and p(A) = | ^ |f > 1 is the numeric rank of A. 


These methods linearly project the n rows of A to £ rows of B. A survey by Woodruff |651 (especially 
Section 2.1) gives an excellent account of this area. In the simplest version, each row Oj G A would map to 
a row hj G B with element Sj^i (jth row and Ah column) of a projection matrix S, and each sj^i is a Gaussian 
random variable with 0 mean and s/nJJ standard deviation. That is, B = SA, where S' is £ x n. This follows 


from the celebrated Johnson-Lindenstrauss lemma |45| as first shown by Sarlos |59| and strengthened by 
Clarkson and Woodruff p9|. Gaussian random variables Sj^i can be replaced with (appropriately scaled) 


{—1, 0, +1} or {—1, +1} random variables Q. We call the version with scaled { — 1, +1} random variables 
as Random Projection. 


Fast JLT. Using a sparse projection matrix X would improve the runtime, but these lose guarantees if the 
input is also sparse (if the non-zero elements do not align). This is circumvented by rotating the space with 
a Hadamard matrix Q, which can be applied more efficiently using FFT tricks, despite being dense. More 
precisely, we use three matrices: P is £ x n and has entries with iid 0 with probability I — q and a Gaussian 
random variable with variance £lq with probability q = min{l, 0((log^ n)/d)}. H is n x n and a random 
Hadamard (this requires n to be padded to a power of 2). D is diagonal with random {—1, +1} in each 
diagonal element. And then the projection matrix is 5 = PHD, although algorithmically the matrices are 
applied implicitly. We refer to this algorithm as Fast JLT. Ultimately, the runtime is brought from 0{nd£) 
to 0{ndlogd + {d/e^) logn). The second term in the runtime can be improved with more complicated 
constructions pBl|^ 
of these extensions. 


which we do not pursue here; we point the reader here |621 for a discussion of some 


Sparse Random Projections. Clarkson and Woodruff JlUj analyzed a very sparse projection matrix S, 


conceived of earlier [25 64|; it has exactly 1 non-zero element per column. To generate S, for each column 
choose a random value between 1 and £ to be the non-zero, and then choose a — 1 or -fl for that location. 
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Thus each rows can be processed in time proportional to its number of non-zeros; it is randomly added or 
subtracted from 1 row of B, as a count sketch | [T7] | on rows instead of counts. We refer this as Hashing. 

A slight modification by Nelson and Nguyen | [54| , called OSNAP, stacks s instances of the projection 
matrix S on top of each other. If Hashing used I' rows, then OSNAP uses I = s ■ (! rows (we use s = 4). 


2.3 Iterative Matrix Sketching 



£ 

cov-err 

£ 

proj-err 

runtime 

Frequent Directions 

Iterative SVD 

k + l/£ 

£ 

^11^ [ 

40 

1 

k/e 

£ 

1 -F e 1 

41 

1 

nd£ 

nd£‘^ 



Table 3: Theoretical Bounds for Iterative Algorithms. 

The main structure of these algorithms is presented in Algorithm |2.1| they maintain a sketch of Ajjj, 
the first i rows of A. The sketch of always uses at most i — I rows. On seeing the zth row of A, it is 
appended to Oj] —I and if needed the sketch is reduced to use at most I — \ rows again using 

some ReduceRank procedure. Notationally we use Oj as the jth singular value in S, and a' as the jth 
singular value in S'. 


Algorithm 2.1 (Generic) FD Algorithm 
Input: G (0,1],A G 

Hjo] ^ all zeros matrix G 

for i G [n] do 

Insert ai into a zero valued rows of result is Hjj] 

if (H[j] has no zero valued rows) then 
[[/,5,H] ^svd(H[,]) 

C[j] = SV'^ # Only needed for proof notation 

S' ^ ReduceRank(S') 

^ S'V^ 

return B = B^^] 


The simplest variant of this procedure is a heuristic rediscovered several times | T^43l44| 
with a few minor modifications, and we refer to as iterative SVD or iSVD. Here ReduceRank(S', V) 
simply keeps a'j = aj for y < I and sets a'f = 0. This has no worst case guarantees (despite several claims). 


Iterative SVD. 

HEZ 


Frequent Directions. Recently Liberty | |50| proposed an algorithm called Frequent Directions (or FD), 
further analyzed by Ghashami and Phillips pT| , and then together jointly with Woodruff | [4()| . The Reduc¬ 
eRank step sets each a'j = .^cr| — 5i where 5i = af. 

Liberty also presented a faster variant FastFD, that instead sets 5i = (Ihe (^/2)th squared singular 
value of H[q) and updates new singular values to a'j = max{0, — <5*}, hence ensuring at most half of 

the rows are all zeros after each such step. This reduces the runtime from 0{nd£‘^) to 0{ndi) at expense of 
a sketch sometimes only using half of its rows. 
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3 New Matrix Sketching Algorithms 

Here we describe our new variants on FD that perform better in practice and are backed with error guaran¬ 
tees. In addition, we explain a couple of new matrix sketching techniques that makes subtle but tangible 
improvements to the other state-of-the-art algorithms mentioned above. 

3.1 New Variants On FrequentDirections 



£ 

cov-err 

£ 

proj-err 

runtime 

Fast a-FD 

{k + ^le)la 

^ WM'i 

kj {eoL) 

1 + £ 

nd£/a 

SpaceSaving Directions 

k -F 1/e 

JA-A^\\l 

kje 

1 + £ 

nd£ 

Compensative FD 

k + l/e 

J\A-A^\\% 

k/e 

1+5 

nd£ 


Table 4: Theoretical Bounds for New Iterative Algorithms. 

Since all our proposed algorithms on FrequentDirections share the same structure, to avoid repeating 
the proof steps, we abstract out three properties that these algorithms follow and prove that any algorithm 
with these properties satisfy the desired error bounds. This slightly generalizes (allowing for a 1) a recent 
framework | [40| . We prove these generalizations in Appendix [A| 

Consider any algorithm that takes an input matrix A G and outputs a matrix B G which 

follows three proprties below (for some parameter a G (0,1] and some value A > 0): 

• Property 1: For any unit vector x we have ||Ax|p — ||Hx|p > 0. 

• Property 2: For any unit vector x we have || Ax|p — ||Hx|p < A. 

• Property 3: ||A|||, — ||H|||, > aA£. 

Lemma 3.1. Any B satisfying the above three properties satisfies 

0 < IIA^A - B^Bh < - AkWl, 

al — k 

cy P 

and \\A- ^ P " 

where (•) represents the projection operator onto B^, the top k singular vectors of B. 

Thus setting I = k -\- 1/e achieves ||A^A — B"^B \\2 < £||A — Afc||p, and setting £ = k + k/e achieves 

||A - -KBkiAfW < (1 + e)\\A - AkWp. FD maintains an £ X d matrix B (i.e. using 0{£d) space), and it is 

shown pTI that there exists a value A that FD satisfies three above-mentioned properties with a = 1. 

Parameterized FD. Parameterized FD uses the following subroutine (Algorithm |3.1| ) to reduce the rank 
of the sketch; it zeros out row £. This method has an extra parameter a G [0,1] that describes the fraction of 
singular values which will get affected in the ReduceRank subroutine. Note iSVD has a = 0 and FD has 
a = 1. The intuition is that the smaller singular values are more likely associated with noise terms and the 
larger ones with signals, so we should avoid altering the signal terms in the ReduceRank step. 

Here we show error bounds asymptotically matching FD for a-FD (for constant a > 0), by showing the 
three Properties hold. We use A = 5i. 

Lemma 3.2. For any unit vector x and any a > 0.' 0 < [jC'^jplP — < 6i. 
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Algorithm 3.1 ReduceRank-PFD(5', a) 

Si •<— (jf 


return diag(cJi,..., fT£(i_o), -Si,... 

,^Jaj-Si) 


Proof. The right hand side is shown by just expanding HCijplP — ||i?[j]x|p. 

Ill 

\\C[i]xf - WB^xf = '^a]{vj,xf -'^a']{vj,xf = ^(ct| - a'])(vj,xf 

j=i j=i j=i 

l 

= 6i ^ {vj,x)'^ < SiWxW"^ = Si 

j=il-a)l+l 

To see the left side of the inequality Si — 0- 

Then summing over all steps of the algorithm (using ||aix|p = — ||-B[j_i]x|p) it follows (see 

Lemma 2.3 in ||4T|) that 

n 

0 < pxf - \\Bxf < = A, 

i=l 

proving Property 1 and Property 2 about a-FD for any a G [0,1]. 

Lemma 3.3. For any a G (0,1], || A|||, — ||i?|||' = aAf', proving Property 3. 

Proof. We expand that ||C[j] = Y^^j=i 'x] to get 


{l—a)l I 

\\C[i\\\F= ^ 




i=i 

(l — a)l 


j = (l-Q)l+l 


— j + j + Si) — ||i3[j]||p + aiSi. 

j = l j = (l-Ol)l+l 

By using ||aj|p = ||C'[j]|||, - = {\\B\^\\], + a(.Si) - \\B[i_i]\\'j,, and summing over i we get 

n n 

= ~ ll-®[i-i]llF + OilSi = ||f?|||’ + alls.. 


□ 


1=1 i=l 

Subtracting ||i?|||' from both sides, completes the proof. 

The combination of the three Properties, provides the following results. 

Theorem 3.1. Given an input matrix A G a-FD with parameter £ returns a sketch B G that 

satisfies for all k > a£ 

0 < Pxf - \\Bxf < P - Akfp/iai - k) 
and projection of A onto B^, the top k rows of B satisfies 

cy P 

\\A-FBM)fF<^fl^^U-A,\\l. 

Setting £= {k + l/e)/a yields 0 < ||^x|p — ||i?x|p < e||^ — and setting £= {k + kje) ja yields 
P-vrs, (21)111 <(l + e)||2l-2l,|||. 









Fast Parameterized FD. Fast Parameterized FD(or Fast a-FD) improves the runtime performance of 
parameterized FD in the same way Fast FD improves the performance of FD. More specifically, in Re- 
DUCeRank we set di as the {£ — ^a/2)th squared singular value, i.e. 5i = for t = £ — £al‘l. 

Then we update the sketch by only changing the last al singular values: we set = max(f7| — 5* *, 0). 

This sets at least a;£/2 singular values to 0 once every a£/2 steps. Thus the algorithm takes total time 
0{nd + njialj^) ■ d£‘^) = 0{nd£ja). 

It is easy to see that Fast a-FD inherits the same worst case bounds as a-FD on cov-err and proj-err, if we 
use twice as many rows. That is, setting £ = 2{k + l/s)/a yields — B'^B \\2 < £\\A — and 

setting £ = 2{k + kle)la yields ||A — 'KBk{A)\\\ < (1 + ^)ll^ “ In experiments we consider Fast 

0.2-FD. 


SpaceSaving Directions. Motivated by an empirical study p2| showing that the SpaceSaving algo¬ 
rithm | [52] | tends to outperform its analog Misra-Gries |53| in practice, we design an algorithm called 
SpaceSaving Directions (abbreviated SSD) to try to extend these ideas to matrix sketching. It uses 
Algorithm |3.2| for ReduceRank. Like the SS algorithm for frequent items, it assigns the counts for the 
second smallest counter (in this case squared singular value to the direction of the smallest. Unlike the 
SS algorithm, we do not use as the squared norm along each direction orthogonal to B, as that gives a 
consistent over-estimate. 


Algorithm 3.2 ReduceRank-SS(5) 

5i <y‘f_i 

return diag(cri,..., fj£_ 2 ,0, Ucr| + <5*). 


We can also show similar error bounds for SSD. It shows that a simple transformation of the output sketch 
B ^ SSD (A) satisfies the three Properties, although B itself does not. We defer these proofs to Appendix 
[Bj just stating the main bounds here. 

Theorem 3.2. After obtaining a matrix B from SSD on a matrix A with parameter £, the following proper¬ 
ties hold: 

• \\A\\l = \\B\\%. 

• for any unit vector X and for k < £/2 —1/2, we have 11| Axp —||i?x|p| < \\A—Ai^\\‘jp/{£/2 — l/2 — k). 

• for k < £/2 — 1 we have ||A — 7r^(A)|||, < ||A — Ak\\‘j?{£ — l)/{£ — 1 — 2k). 

Setting £ = 2k -\- 2/e + 1 yields 0 < ||Ax|p — < e||A — and setting f = 2fc + 1 + 2k/£ 

yields ||A - 7rsj^(A)|||, < (1 + e)||A - Afc|||. 


Compensative Frequent Directions. Inspired by the isomorphic transformation 1(8| between the Misra- 
Gries | |53| and the SpaceSaving sketch | |52| , which performed better in practice on the frequent items prob¬ 
lem 1221, we consider another variant of FD for matrix sketching. In the frequent items problem, this would 
return an identical result to SSD, but in the matrix setting it does not. 

We call this approach COMPENSATIVE FrequentDirections (abbreviated CFD). In original FD, the 
computed sketch B underestimates the Frobenius norm of stream | |4T| , in CFD we try to compensate for 
this. Specifically, we keep track of the total mass A = subtracted from squared singular values 

(this requires only an extra counter). Then we slightly modify the FD algorithm. In the final step where 


B = S'V'^, we modify S' to S by setting each singular value bj = y + A, then we instead return 
B = SV^. 
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It now follows that for any k < i, including k = 0, that ||A|||, = that for any unit vector x 

we have |||74x|||. — ||-Bx|||.| < A < ||^ — Ak\\\/{i — k) for any k < i, and since V is unchanged that 
11^ — Tr^{A)\\‘jp < \\A — Ak\\‘j^ ■ £/{i — k). Also as in FD, setting £ = k + l/s yields 0 < ||Ax|p — ||-Bx|p < 
e\\A - Ak\\p and setting £ = fc/e yields \\A - 7rBj^(A)|||, < (1 + e)||A - Ak\\p. 

3.2 New Without Replacement Sampling Algorithms 



£ 

cov-err 

£ 

proj-err 

runtime 

Priority 

dje^ 

£ 

£ 

- 

nnz(A) \og£ 

VarOpt 

dje^ 

£ 

£ 

- 

nnz(A) \og£ 


Table 5: Theoretical Bounds for New Sampling Algorithms. 


As mentioned above, most sampling algorithms use sampling with replacement (SwR) of rows. This is 
likely because, in contrast to sampling without replacement (SwoR), it is easy to analyze and for weighted 
samples conceptually easy to compute. SwoR for unweighted data can easily be done with variants of 
reservoir sampling ||63| ; however, variants for weighted data have been much less resolved until recently 

wn. 


Priority Sampiing. A simple technique | [T7| for SwoR on weighted elements first assigns each element i 
a random number Ui G Unif(0, 1). This implies a priority pi = Wijui, based on its weight Wi (which for 
matrix rows Wi = ||a||f). We then simply retain the £ rows with largest priorities, using a priority queue 
of size £. Thus each step takes 0{log£) time, but on randomly ordered data would take only 0(1) time in 
expectation since elements with pi < r, where r is the ^th largest priority seen so far, are discarded. 

Retained rows are given a squared norm Wi = max(t(;j, r). Rows with rcj > r are always retained with 
original norm. Small weighted rows are kept proportional to their squared norms. The technique. Priority 
Sampiing, is simple to implement, but requires a second pass on retained rows to assign final weights. 


VarOpt Sampling. VarOpt (or Variance Optimal) sampling is a modification of priority sampling that 
takes more care in selecting the threshold r. In priority sampling, r is generated so wi] = || A|||., 

but if T is set more carefully, then we can achieve Yla &B ~ II^IIf deterministically. VarOpt selects each 
row with some probability pi = min(l, Wijr), with Wi = max{wi, r), and so exactly £ rows are selected. 

The above implies that for a set L of £ rows maintained, there is a fixed threshold r that creates the 
equality. We maintain this value r as well as the t weights smaller than r inductively in L. If we have seen 
at least £ + 1 items in the stream, there must be at least one weight less than r. On seeing a new item, we 
use the stored priorities pi = Wi/Ui for each item in L to either (a) discard the new item, or (b) keep it and 
drop another item from the reservoir. As the priorities increase, the threshold r must always increase. It 
takes amortized constant time to discarding a new item or 0{\og£) time to keep the new item, and does not 
require a final pass on L. We refer to it as VarOpt. 

A similar algorithm using priority sampling was considered in a distributed streaming setting p9| , which 
provided a high probability bound on cov-err. A constant probability of failure bound for £ = 0{d/e^) 
and cov-err < e, follows with minor modification from Section |C.1| It is an open question to bound the 


projection error for these algorithms, but we conjecture the bounds will match those of Norm Sampling. 


4 Experimental Setup 

We used an OpenSUSE 12.3 machine with 32 cores of Intel(R) Core(TM) i7-4770S CPU(3.10 GHz) and 
32GB of RAM. Randomized algorithms were run five times; we report the median error value. 
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DataSet 

# datapoints 

# attributes 

rank 

numeric rank 

nnz% 

excess kurtosis 

Birds 

11789 

312 

312 

12.50 

100 

1.72 

Random Noisy 

10000 

500 

500 

14.93 

100 

0.95 

CIFAR-10 

60000 

3072 

3072 

1.19 

99.75 

1.34 

ConnectUS 

394792 

512 

512 

4.83 

0.0055 

17.60 

Spam 

9324 

499 

499 

3.25 

0.07 

3.79 

Adversarial 

10000 

500 

500 

1.69 

100 

5.80 


Table 6: Dataset Statistics. 
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(a) Birds 

(b) Random Noisy 

(c) CIFAR-10 



(d) ConnectUS 


(e) Spam (f) Adversarial 


Figure 1: Singular values distribution for datasets in Table The x-axis is singular value index, and the 
y-axis shows normalized singular values such that the highest singular value is one, i.e., each value divided 
by largest singular value of dataset 


Datasets. We compare performance of the algorithms on both synthetic and real datasets. In addition, we 
generate adversarial data to show that iSVD performs poorly under specific circumstances, this explains why 
there is no theoretical guarantee for them. Each data set is an n x d matrix A, and the n rows are processed 
one-by-one in a stream. 

Table|^lists all datasets with information about their n, d, rank(A), numeric rank ||74|||,/||A||2, percent¬ 
age of non-zeros (as nnz%, measuring sparsity), and excess kurtosis. We follow Fisher’s distribution with 
baseline kurtosis (from normal distribution) is 0; positive excess kurtosis reflects fatter tails and negative 
excess kurtosis represents thinner tails. 

For Random Noisy, we generate the input n x d matrix A synthetically, mimicking the approach by 
Fiberty |50|. We compose A = SDU + F/(, where SDU is the m-dimensional signal (for m < d) and 
F/C, is the (full) d-dimensional noise with controlling the signal to noise ratio. Each entry Fjj of F is 
generated i.i.d. from a normal distribution A^(0,1), and we set Q = 10. For the signal, S G again we 

generate each Sij ~ iV(0,1) i.i.d; D is diagonal with entries Di^i = 1 — (f — l)/d linearly decreasing; and 
U G is just a random rotation. We use n = 10000, d = 500, and consider m G {10, 20,30, 50} with 
m = 30 as default. 

In order to create Adversarial data, we constructed two orthogonal subspaces Si = and S 2 = 

(mi = 400 and m 2 = 4). Then we picked two separate sets of random vectors Y and Z and projected 
them on Si and S 2 , respectively. Normalizing the projected vectors and concatenating them gives us the 
input matrix A. All vectors in 7r5^(y) appear in the stream before 7rs2iZ)-, this represents a very sudden 
and orthogonal shift. As the theorems predict, FD and our proposed algorithms adjust to this change and 
properly compensate for it. However, since mi > £, then iSVD cannot adjust and always discards all new 
rows in S 2 since they always represent the smallest singular value of . 

We consider 4 real-world datasets. ConnectUS is taken from the University of Florida Sparse Matrix 
collection Q. ConnectUS represents a recommendation system. Each column is a user, and each row is 
a webpage, tagged 1 if favorable, 0 otherwise. It contains 171 users that share no webpages preferences 
with any other users. Birds Q has each row represent an image of a bird, and each column a feature. 
PCA is a common first approach in analyzing this data, so we center the matrix. Spam Q has each row 
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represent a spam message, and each column some feature; it has dramatic and abrupt feature drift over the 
stream, but not as much as Adversarial. CIFAR-10 is a standard computer vision benchmark dataset for deep 
learning | |47| . 

The singular values distribution of the datasets is given in Figure [T] The x-axis is the singular value 
index, and the y-axis shows the normalized singular values, i.e. singular values divided by ui, where ai 
is the largest singular value of dataset. Birds, ConnectUS and Spam have consistent drop-offs in singular 
values. Random Noisy has initial sharp and consistent drops in singular values, and then a more gradual 
decrease. The drop-offs in CIFAR-10 and Adversarial are more dramatic. 

We will focus most of our experiments on three data sets Birds (dense, tall, large numeric rank). Spam 
(sparse, not tall, negative kurtosis, high numeric rank), and Random Noisy (dense, tall, synthetic). How¬ 
ever, for some distinctions between algorithms require considering much larger datasets; for these we use 
CIFAR-10 (dense, not as tall, small numeric rank) and ConnectUS (sparse, tall, medium numeric rank). 
Finally, Adversarial and, perhaps surprisingly ConnectUS are used to show that using iSVD (which has no 
guarantees) does not always perform well. 


5 Experimental Evaluation 

We divide our experimental evaluation into four sections: The first three sections contain comparisons within 
algorithms of each group (sampling, projection, and iterative), while the fourth compares accuracy and run 
time of exemplar algorithm in each group against each other. 

We measure error for all algorithms as we change the parameter i (Sketch Size) determining the number 
of rows in matrix B. We measure covariance error as err = WA"’" A — B\\ 2 /\\A\\‘j^ (Covariance Error); this 

indicates for instance for FD, that err should be at most 1/i, but could be dramatically less if || A — is 
much less than ||^|||' for some not so large k. We consider proj-err = ||A — 7rBj,(^)|||./||^ — Afclll’. always 
using fe = 10 (Projection Error); for FD we should have proj-err < il{£ — 10), and > 1 in general. We also 
measure run-time as sketch size varies. 

Within each class, the algorithms are not dramatically different across sketch sizes. But across classes, 
they vary in other ways, and so in the global comparison, we will also show plots comparing runtime to 
cov-err or proj-err, which will help demonstrate and compare these trade-offs. 

5.1 Sampling Algorithms 

Figure]^ shows the covariance error, projection error, and runtime for the sampling algorithms as a function 
of sketch size, run on the Birds, Spam, and Random Noisy(30) datasets with sketch sizes from £ = 20 to 
100. We use parameter k = 10 for Leverage Sampling, the same k used to evaluate proj-err. 

First note that Deterministic Leverage performs quite differently than all other algorithms. The error rates 
can be drastically different: smaller on Random Noisy proj-err and Birds proj-err, while higher on Spam proj- 
err and all cov-err plots. The proven guarantees are only for matrices with Zipfian leverage score sequences 
and proj-err, and so when this does not hold it can perform worse. But when the conditions are right it 
outperforms the randomized algorithms since it deterministically chooses the best rows. 

Otherwise, there is very small difference between the error performance of all randomized algorithms, 
within random variation. The small difference is perhaps surprising since Leverage Sampling has a stronger 
error guarantee, achieving a relative proj-err bound instead of an additive error of Norm Sampling, Priority 
Sampling and VarOpt Sampling which only use the row norms. Moreover Leverage Sampling and Determin¬ 
istic Leverage Sampling are significantly slower than the other approaches since they require first computing 
the SVD and leverage scores. We note that if ||^ — > c||A|||. for a large enough constant c, then for 

that choice of k, the tail is effectively fat, and thus not much is gained by the relative error bounds. Moreover, 
Leverage Sampling bounds are only stronger than Norm Sampling in a variant of proj-err where [7rs(^)]fc 
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Figure 2: Sampling algorithms on Birds(left), Spam(middle), and Random Noisy(30)(right). 


(with best rank k applied after projection) instead of 'Kb^{A), and cov-err bounds are only known (see Ap¬ 
pendix [CT]) under some restrictions for Leverage Sampling, while unrestricted for the other randomized 
sampling algorithms. 


5.2 Projection Algorithms 

Figure plots the covariance and projection error, as well as the runtime for various sketch sizes of 20 to 
100 for the projection algorithms. 

Otherwise, there were two clear classes of algorithms. For the same sketch size. Hashing and OSNAP 
perform a bit worse on projection error (most clearly on Noisy Random), and roughly the same in covariance 
error, compared to Random Projections and Fast JLT. Note that Fast JLT seems consistently better than 
others in cov-err, but we have chosen the best q parameter (sampling rate) by trial and error, so this may 
give an unfair advantage. Moreover, Hashing and OSNAP also have significantly faster runtime, especially 
as the sketch size grows. While Random Projections and Fast JLT appear to grow in time roughly linearly 
with sketch size. Hashing and OSNAP are basically constant. Section [5^ on larger datasets and sketch sizes, 
shows that if the size of the sketch is not as important as runtime. Hashing and OSNAP have the advantage. 
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Figure 3: Projection algorithms on Birds(left), Spam(middle), and Random Noisy(30)(right). 


5.3 Iterative Algorithms 

Here we consider variants of FD. We first explore the a parameter in Parametrized FD, writing each version 
as a-FD. Then we compare against all of the other variants using explores from Parametrized FD. 

In Figure|^and[^ we explore the effect of the parameter a, and run variants with a G {0.2, 0.4,0.6,0.8}, 
comparing against FD {a = 1) and iSVD (a = 0). Note that the guaranteed error gets worse for smaller a, 
so performance being equal, it is preferable to have larger a. Yet, we observe empirically on datasets Birds, 
Spam, and Random Noisy that FD is consistently the worst algorithm, and iSVD is fairly consistently the 
best, and as a decreases, the observed error improves. The difference can be quite dramatic; for instance 
in the Spam dataset, for I = 20, FD has err = 0.032 while iSVD and 0.2-FD have err = 0.008. Yet, as 
i approaches 100, all algorithms seems to be approaching the same small error. In Figure we explore 
the effect of a-FD on Random Noisy data by varying m G {10,20,50}, and m = 30 in Figure]^ We 
observe that all algorithms get smaller error for smaller m (there are fewer “directions” to approximate), but 
that each a-FD variant reaches 0.005 err before t = 100, sooner for smaller a; eventually “snapping” to a 
smaller 0.002 err level. 

Next in Figure we compare iSVD, FD, and 0.2-FD with two groups of variants: one based on SS 
streaming algorithm (CFD and SSD) and another based on Fast FD. We see that CFD and SSD typically 
perform slightly better than FD in cov-err and same or worse in proj-err, but not nearly as good as 0.2- 
FD and iSVD. Perhaps it is surprising that although SpaceSavings variants empirically improve upon MG 
variants for frequent items, 0.2-FD (based on MG) can largely outperform all the SS variants on matrix 
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Figure 4: Parametrized FD on Birds (left), and Spam (middle), Random Noisy(30) (right). 




Figure 5: Parametrized FD on Random Noisy for m = 50 (left), 20 (middle), 10 (right), 
sketching. 

All variants achieve a very small error, but 0.2-FD, iSVD, and Fast 0.2-FD consistently matches or outper¬ 
forms others in both cov-err and proj-err while Fast FD incurs more error compare to other algorithms. We 
also observe that Fast FD and Fast 0.2-FD are significantly (sometimes 10 times) faster than FD, iSVD, and 
0.2-FD. Fast FD takes less time, sometimes half as much compared to Fast 0.2-FD, however, given its much 
smaller error Fast 0.2-FD seems to have the best all-around performance. 

Data adversarial to ISVD. Next, using the Adversariai construction we show that iSVD is not always 
better in practice. In Figure]^ we see that iSVD can perform much worse than other techniques. Although 
at £ = 20, iSVD and FD roughly perform the same (with about err = 0.09), iSVD does not improve much 
as I increases, obtaining only err = 0.08 for I = 100. On the other hand, FD (as well as CFD and SSD) 
decrease markedly and consistently to err = 0.02 for I = 100. Moreover, all version of a-FD obtain roughly 
err=0.005 already for ^ = 20. The large-norm directions are the first 4 singular vectors (from the second 
part of the stream) and once these directions are recognized as having the largest singular vectors, they are 
no longer decremented in any Parametrized FD algorithm. 

To wrap up this section, we demonstrate the scalability of these approaches on a much larger real data 
set ConnectUS. Figurej^ shows variants of Parameterized FD, and other iterative variants on this dataset. 
As the derived bounds on covariance error based on sketch size do not depend on n, the number of rows 
in A, it is not surprising that the performance of most algorithms is unchanged. There are just a couple 
differences to point out. First, no algorithm converges as close to 0 error as with the other smaller data sets; 
this is likely because with the much larger size, there is some variation that can not be captured even with 
I = 100 rows of a sketch. Second, iSVD performs noticeably worse than the other FD-based algorithms 
(although still significantly better than the leading randomized algorithms). This likely has to do with the 
sparsity of ConnectUS combined with a data drift. After building up a sketch on the first part of the matrix, 
sparse rows are observed orthogonal to existing directions. The orthogonality, the same difficult property as 
in Adversarial, likely occurs here because the new rows have a small number of non-zero entrees, and all 
rows in the sketch have zeros in these locations; these correspond to the webpages marked by one of the 


15 















Birds 


Spam 


Random Noisy 




Figure 6: Iterative algorithms on Birds(left), Spam(middle), and Random Noisy(30)(right). 




Figure 7: Demonstrating dangers of iSVD on Adversarial data. 

unconnected users. 

5.4 Global Comparison 

Figure shows the covariance error, projection error, as well as the runtime for various sketch sizes of 
^ = 20 to 100 for the the leading algorithms from each category. 

We can observe that the iterative algorithms achieve much smaller errors, both covariance and projection, 
than all other algorithms, sometimes matched by Deterministic Leverage. However, they are also signifi¬ 
cantly slower (sometimes a factor of 20 or more) than the other algorithms. The exception is Fast FD and 
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Figure 8: Parameterized FD (left), other iterative (right) algorithms on ConnectUS dataset. 











Figure 9: Leading algorithms on Birds(left), Spam(middle), and Random Noisy(30)(right). 


Fast 0.2-FD, which are slower than the other algorithms, but not significantly so. 

We also observe that for the most part, there is a negligible difference in the performance between the 
sampling algorithms and the projection algorithms, except for the Random Noisy dataset where Hashing 
and OSNAP result in worse projection error. 

However, if we allow a much large sketch size for faster runtime and small error, then these plots do not 
effectively demonstrate which algorithm performs best. Thus in Figure[T^we run the leading algorithms on 
Birds as well as larger datasets, ConnectUS which is sparse and CIFAR-10 which is dense. We plot the error 
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Figure 10: Projection error versus time on Birds and ConnectUS as well as Covariance error versus time 
on CIFAR-10. The second line shows close-ups 


versus the runtime for various sketch sizes ranging up to £ = 10,000. The top row of the plots shows most 
data points to give a holistic view, and the second row zooms in on the relevant portion. 

For some plots, we draw an Error Threshold vertical line corresponding to the error achieved by Fast 
0 .2-FD using £ = 20. Since this error is typically very low, but in comparison to the sampling or projection 
algorithms Fast 0.2-FD is slow, this threshold is a useful target error rate for the other leading algorithms. 

We observe that Fast FD can sometimes match this error with slightly less time (see on Birds), but requires 
a larger sketch size of £ = 100. Additionally VarOpt, Priority Sampiing, Hashing, and OSNAP can often 
meet this threshold. Their runtimes can be roughly 100 to 200 times faster, but require sketch sizes on the 
order o^l = 10,000 to match the error of Fast 0.2-FD with I = 20. 

Among these fast algorithms requiring large sketch sizes we observe that VarOpt scales better than Priority 
Sampiing, and that these two perform best on CIFAR-10, the large dense dataset. They also noticeably 
outperform Norm Sampling both in runtime and error for the same sketch size. On the sparse dataset 
ConnectUS, algorithms Hashing and OSNAP seem to dominate Priority Sampling and VarOpt, and of those 
two Hashing performs slightly better. 

To put this space in perspective, on CIFAR-10 (n = 60,000 rows, 1.4GB memory footprint), to approxi¬ 
mately reach the error threshold Hashing needs I = 10,000 and 234MB in 2.4 seconds, VarOpt Sampling 
requires £ = 5,000 and 117MB in 1.2 seconds. Fast FD requires i = 100 and 2.3MB in 130 seconds, and 
Fast 0.2-FD requires ^ = 20 and 0.48MB in 128 seconds. All of these will easily fit in memory of most 
modem machines. The smaller sketch by Fast 0.2-FD will allow expensive downstream applications (such 
as deep learning) to run much faster. Alternatively, the output from VarOpt Sampling (which maintains 
interpretability of original rows) could be fed into Fast 0.2-FD to get a compressed sketch in less time. 

Reproducibility. We provide public access to our results using a testbed facility APT Q. APT is a plat¬ 
form where researchers can perform experiments and keep them public for verification and validation of the 
results. We provide our code, datasets, and experimental results in our APT profile with detailed description 
on how to reproduce, available at: http : / / apt lab . net/p/MatrixApx/MatrixApproxComparision 
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A Generalized Error Bounds for fd 


Lemma A.l. In any such algorithm, for any unit vector x: 

0 < ||Ax|p — < ||A — Ak\\\/{al — k) 
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Proof. In the following, y* correspond to the singular vectors of A ordered with respect to a decreasing 
corresponding singular value order. 


aM< \\A\\l-\\B 


via Property 3 


k d 

i=\ i=k-\-l 

k 

= ^ + ||A — AkW'jp — \\B\\p 

i=l 
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< M ^ {II.49.P - IISkP) 
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— 11^ ~ 3" 
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i=\ 
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^\\By,f<\\B 

i=l 


2 

F 


via Property 2 


Solving aA£ < ||yl — A^W'j^ + kA for A to obtain A < ||^ — Ai~\\‘j^/{ai — k), which combined with 
Property 1 and Property 2 proves the lemma. □ 


Lemma A.2. Any such algorithm described above, satisfies the following error bound 

\\A-FBd^)\\ ^ ai/{a£- k)\\A- AkWp 

Where (') represents the projection operator onto Bk, the top k singular vectors of B. 

Proof Here, y* correspond to the singular vectors of A as above and Vi to the singular vectors of H in a 
similar fashion. 
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Pythagorean theorem 


via Property 1 

i j 

since 

2=1 2=1 

via Property 2 


by A < \\A - AkWp/iai - k) 


This completes the proof of lemma. 


□ 


B Error Bounds for ssd 
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To understand the error bounds for ReduceRank-SS, we will consider an arbitrary unit vector x. We 
can decompose x = X]j=i Pj'^j where /3| = {x, Vj)"^ > 0 and Pj = 1- For notational convenience, 
without loss of generality, we assume that /3j = 0 for j > 1. Thus vi-\ represents the entire component of 
X in the null space of B (or after processing row i). 

To analyze this algorithm, at iteration i >£, we consider a d x d matrix Sjj] that has the following 
properties: fory < f"— 1 andy = £, and = 6i for j = £ — 1 andy > £. This 

matrix provides the constant but bounded overcount similar to the SS sketch. Also let Ajj] = [oi; 02 ;...; Hi]. 

Lemma B.l. For any unit vector x we have 0 < ||.B[j]x|p — < 26i 

Proof. We prove the first inequality by induction on i. It holds for i = — 1, since and 

||-B[j]x|p > We now consider the inductive step at i. Before the reduce-rank call, the property 

holds, since adding row a* to both Ajj] (from A[j_i]) and C[j] (from increases both squared norms 

equally (by (oi, x)^) and the left rotation by U'^ also does not change norms on the right. On the reduce- 
rank, norms only change in directions and vi^i- Direction vi increases by 5i, and in the directions 
Vi-i also does not change, since it is set back to 6i, which it was before the reduce-rank. 

We prove the second inequality also by induction, where it also trivially holds for the base case i = £ — 1. 
Now we consider the inductive step, given it holds for i — 1. First obverse that <5* > 6i-i since 6i is at least 
the {£ — l)st squared singular value of which is at least Thus, the property holds up to the 

reduce rank step, since again, adding row Oj and left-rotating does not affect the difference in norms. After 
the reduce rank, we again only need to consider the two directions changed vi^i and By definition 

||^[i]^£-i|P + 2<5i > 6i = WB^ijVi-iW"^, 

so direction vi-i is satisfied. Then 

ll-^[i]WlP = = 5i + ||C'[j]t>£|p < 25i 

and 0 < llAjjjx^lp < ||.B[j]X£|p. Hence — || Ajjp^lp < 2(5j — 0, satisfying the property for direction 

vi, and completing the proof. □ □ 


Now we would like to prove the three Properties needed for relative error bounds for B = By^. But this 
does not hold since ||i?|||. = ||A|||, (an otherwise nice property), and \\B\\j^ 3> ||A|||.. Instead, we first 
consider yet another matrix B defined as follows with respect to B. B and B have the same right singular 
values V. Let 6 = 6n, and for each singular value aj of B, adjust the corresponding singular values of B to 

be dj = max{0, Bcr'j — 25}. 


Lemma B.2. For any unit vector x we have f) < ||Ax|p — ||.Bx|p < 25 and 


2 _ II R| 
F ll-°l 


> <5(^-1). 


Proof. Directions Vj for 3 > ^-1, the squared singular values are shrunk by at least 5. The squared singular 
value is already 0 for direction X£_i. And the singular value for direction is shrunk by 5 to be exactly 0. 
Since before shrinking ||H|||, = ||A|||,, the second expression in the lemma holds. 

The first expression follows by Lemma 


B.l 


since B only increases the squared singular values in direc- 


■j for y = ^ — 1 and y > by 5, which are 0 in B. And other directions vj are the same for B and B 


tions V 

and are at most 25 larger than in A. 


□ 


Thus B satisfies the three Properties. We can now state the following property about B directly, setting 
a = (1/2), adjusting £\.o £ — 1, then adding back the at most 25 = A < ||A — A}^\\‘^p/{a£ — a — k)\.o each 
directional norm. 


Theorem B.l. After obtaining a matrix B from SSD on a matrix A with parameter £, the following proper¬ 
ties hold: 
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• \\A\\l = \\B\\l. 

• for any unit vector X and for k < £/2 —1/2, we have | ||74a;|p —||i?x|p| < \\A—Ak\\‘jp/{i/2 — l/2 — k). 

• for k < £/2 — 1 we have ||^ — 7r^{A)\\‘p < ||^ — AkWpf — !)/{£ — 1 — 2k). 


C Adapting Known Error Bounds 

Not many algorithms state bounds in terms of covariance error or the projection bounds in the particular 
setting we consider. Here we show using properties of these algorithms they achieve covariance error bound 
too. 


C.1 Column Sampling Algorithms 

Almost all column sampling algorithms have a projection bound in terms of || A—vr^AIII, < /(e)|| A— 
where A G x d is the input matrix and B G is the output sketch. Here we derive the other type 
of bound, cov-err, for the two main algorithms in this regime; i.e. Norm Sampling and Leverage Sampling. 
In our proof, we use a variant of Chernoff-Hoeffding inequality: Consider a set of r independent random 
variables {Xi, • • • , X^} where 0 < Xj < A. Let M = ’^^en for any a G (0,1/2) 

/_2fv2\ 

Pr [\M — E[M]| > a] <2 exp ( j • 


Lemma C.l. Let B G with I = Oidje^) be the output of Norm Sampling. Then with probability 
99/100, for all unit vectors x G 


cov-err{A,B) = 


— ||Ax| 

2 
F 


< S. 


Proof. Consider any unit vector x G Define (. independent random variables Xi = {bi,x)^ for i = 
Recall that Norm Sampling selects row Oj to be row in the sketch matrix B with probability 
Pr(6j •«— aj) = ||ajjp/||A|||,, and rescales the sampled row as ||6i|p = \\aj\\'^/{£Fifbi ^ aj)) = ||A|||,/L 
Knowing these, we bound each A* as 0 < Xi < ||6j|p = ||A|||,/£ therefore Aj = ||A|||,/£ for all AjS. 
Setting M = Yl!i=i observe 


£ £ n 

E[M] = ^E[A,] = ^^Pr(5, 

i=i j=i 


.J. 


i=l 


\ i/£Pr(6i ^ aj) 


,x ) = 




i=l j=l 


Finally using the Chernoff-Hoeffding bound and setting a = e 


yields 


Pr [|||Hx|p - ||Ax|p| > e||A|||.] < 2exp 


Letting the probability of failure for that x be d = 1/100, and solving for £ in the last inequality, we obtain 
£ > ^ ln(2/b) = ^ ln(200). 

However, this only holds for a single direction x; we need this to hold for all unit vectors x. It can be 
shown, that allowing a = 0(e) ■ ||A|||,, we actually only need this to hold for a net T of size t = 


such directions x |651. Then by the union bound, setting S = l/(100t) this will hold for all unit vectors in 
T, and thus (after scaling e by a constant) we can solve for £ = 0((l/e^) ln(f)) = 0(d/e^). Hence with 


probability at least 99/100 we have cov-err(A, B) = 


\\\Bx\\^-\\Axf\ 


IIAIl 


< e for all unit vectors x. 


□ 
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We would like to apply a similar proof for Leverage Sampling, but we do not obtain a good bound for A 
in the Chernoff-Hoeffding bound. We rescale norm of each selected row bi as 


\\hf = 


C[<7 


1 


Sk 


i Fr{bi 


(fc) ’ 


where is the rank-A; leverage score of aj, and Sk = Unfortunately, can be arbitrarily 

small compared to Sk (e.g., if aj lies almost entirely outside the best rank-A: subspace). And thus we do not 

(k) 

have a finite bound on A. As such we assume that Sk/sj ' < /3 for an absolute constant /3, and then obtain 
a bound based on (3. 

Lemma C.2. Let B G with i = OidfP' je^) be the output of Leverage Sampling. Under the assump- 

(k) 

tion that rank-k leverage score of row aj is bounded as Sj ' > fiSkfor a fixed constant P > 0, then with 
probability 99/100, for all unit vectors x G 


(fc) _ 




cov-err= |||ilx|p — ||Ax| 




Proof Similar to Lemma C.l we define i random variables Xi = {bi,x)^ for f = 1, 
Sampling algorithm selects row aj to be row hi with probability Pr(6j •(— aj) = s 


(k) 


■ ■ ■ , Leverage 

/ Sk and rescales 


sampled rows as ||6i|p = ||aj|p/(£Pr(6i ^ aj)) = {WajW^/tj{Sk/sf"’) < ||ojp/(/3^) < ||A|||,/(/3£). 
Therefore Aj = ||A|||,/(/3.^) for all AjS. Setting M = — 11-®^IP observe: 

J V ■' 

ttj) ( -— , X ) = 


^ £ 72 

E[M] = ^E[A,] = ^^Pr(6, 

j=i j=i 




2=1 


-y£Pr(6j ^ a j)' 


e 


Using Chernoff-Hoeffding bound with parameter a = e|| A||p gives: 

^ -2e2. 


Pr 


|i?x|| — ||Ax|| I > e 


< 2 exp 


X =i(/3V^^)PII1 


j=i j=i 


= 2 exp 


Again letting the probability of failure for that xhe, 5 = 1/ 100, we obtain I > ^ \d.{2/5) — ^ 

In order to hold this for all unit vectors x, again we consider a net T of size t = unit directions 

X |65|. Then by the union bound, setting 6 = 1/(100A) this will hold for all such vectors in T, and thus 
we can solve for i = 0((/l^/e^) ln(f)) = OldfP' je^). Hence with probability at least 99/100 we have 
cov-err(A, B) = ^ < e for all unit vectors x. □ 


-2le^ 


< 5 


ln(200). 




C.2 Random Projection Algorithms 

Most of random projection algorithms state a bound (with constant probability) that they can create a matrix 
B = SA such that ||ilx|| = (1 ± e)||Ax|| (e.g. (1 — e)||Ax|| < ||Hx|| < (1 + e)||Ax||) for all x G 
Here we relate this to cov-err and proj-err. We first show that whether this bound is squared only affects 
e by a constant factor. 

Lemma C.3. Fore G (0, j), the inequality ||Hx|| = (l±e)||Ax|| implies ||Hx|p = (1 ± 3e)||Ax|p. 

Proof The upper bound follows since (1 + e)^ = 1 + 2e + < 1 + 3e for e G (0, |). The lower bound 

follows since (1 — e)^ = 1 + — 2e > 1 — 2e for e G (0, |). □ 
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Now to relate this bound to cov-err, we will use p{A) = ||A|||,/||^|| 2 , the numeric rank of A, which is 
always at least 1. 

Lemma C.4. Given a matrix B such that ||-Bx|| = (1 ± £)\\Ax\\for all x G R^, then when e G (0, j) 


cov-err = \\^A — B'^B\\ 2 /\\A\\\ < 3e//?(A). 

Proof. Using Lemma C.3 |||Ax|p — ||-Bx|p| < 3e||Ax|p. Now restrict ||x|| = 1, so then 


x^ {A^ A — B^ B)x = x'^A^Ax — x^ B^ Bx = \ ||^a;||^ — ||f?x||^| < 3e||^x 


T t}T , 


Since this holds for all x such that ||x|| = 1, then it holds for the x = x* which maximizes the left hand size 

so x*'^{A^A — B"^B)x* = — B'^B\\2 so 

\\A^A - B^B \\2 = x*'^{A^A - B^B)x* < 3ePx*f < 3eP|||. 

Dividing both sides by ||^|||' = /9(^)/||^||2 completes the proof. □ 

We next relate this to the proj-err. We note that it may be possible that the full property ||i?x|| = (1 ± 
e)||^x|| may not be necessary to obtain a bound on proj-err, but we are not aware of an explicit statement 
otherwise. There are bounds (see | [^ ) where one reconstructs a matrix A which obtains the bounds below 
in place of rrsf. (^), and these have roughly 1/e dependence on e; however, they also have a factor n in their 
size. 


Lemma C.5. Given a matrix B such that ||i?x|| = (1 ± e)||^a;||/or aZZ x G then when e G (0, j) 

proj-err = p - 7rB^(A)|||/P - Akfp < (1 + 24e). 

Proof. Let V = [ni, ^ 2 ,..., nj be the right singular vectors of A, so that Pfcp ll^ulP and 

\\A - Ak\\p = Yli=k+i It follows from \\Bk\\p > PuP and Lemma ( 

d d 


C.3 


\B-BkfF< Y. \\Bvif< Y 


' Uvif< tAtM-Mf- 


i=k-\-l 


i=k-\-l 


1 - 3e 


1 -3e' 


Let R = [ri, r 2 ,..., r^j be the right singular vectors of B. Then by matrix Pythagorean and Lemma C.3 

d d 

\\A —— Y^ MulP < (1 + 3e)||i?rt||^ = (1 + 3e)||i? — jjfcp 

i=k-\-l i=k-\-l 

l + 3e, 
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1 - 3e 


11^ ~ — (1 + 24e)P — AfcP. 
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